data discovery
From keywords to semantics: Perceptions of large language models in data discovery
Halstead, Maura E, Green, Mark A., Jay, Caroline, Kingston, Richard, Topping, David, Singleton, Alexander
This matching requires researchers to know the exact wording that other researchers previously used, creating a challenging process that could lead to missing relevant data. Large Language Models (LLMs) could enhance data discovery by removing this requirement and allowing researchers to ask questions with natural language. However, we do not currently know if researchers would accept LLMs for data discovery. Using a human-centered artificial intelligence (HCAI) focus, we ran focus groups (N = 27) to understand researchers' perspectives towards LLMs for data discovery. Our conceptual model shows that the potential benefits are not enough for researchers to use LLMs instead of current technology. Barriers prevent researchers from fully accepting LLMs, but features around transparency could overcome them. Using our model will allow developers to incorporate features that result in an increased acceptance of LLMs for data discovery.
- North America > United States > New York > New York County > New York City (0.04)
- Europe > United Kingdom > England > Merseyside > Liverpool (0.04)
- Europe > United Kingdom > England > Greater Manchester > Manchester (0.04)
- Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
- Research Report > Experimental Study (0.95)
- Research Report > New Finding (0.70)
- Health & Medicine (1.00)
- Government (1.00)
- Information Technology (0.68)
- Education > Educational Setting (0.46)
LLM-based Multi-Agent Blackboard System for Information Discovery in Data Science
Salemi, Alireza, Parmar, Mihir, Goyal, Palash, Song, Yiwen, Yoon, Jinsung, Zamani, Hamed, Palangi, Hamid, Pfister, Tomas
The rapid advancement of Large Language Models (LLMs) has opened new opportunities in data science, yet their practical deployment is often constrained by the challenge of discovering relevant data within large heterogeneous data lakes. Existing methods struggle with this: single-agent systems are quickly overwhelmed by large, heterogeneous files in the large data lakes, while multi-agent systems designed based on a master-slave paradigm depend on a rigid central controller for task allocation that requires precise knowledge of each sub-agent's capabilities. To address these limitations, we propose a novel multi-agent communication paradigm inspired by the blackboard architecture for traditional AI models. In this framework, a central agent posts requests to a shared blackboard, and autonomous subordinate agents -- either responsible for a partition of the data lake or general information retrieval -- volunteer to respond based on their capabilities. This design improves scalability and flexibility by eliminating the need for a central coordinator to have prior knowledge of all sub-agents' expertise. We evaluate our method on three benchmarks that require explicit data discovery: KramaBench and modified versions of DS-Bench and DA-Code to incorporate data discovery. Experimental results demonstrate that the blackboard architecture substantially outperforms baselines, including RAG and the master-slave multi-agent paradigm, achieving between 13% to 57% relative improvement in end-to-end task success and up to a 9% relative gain in F1 score for data discovery over the best-performing baselines across both proprietary and open-source LLMs. Our findings establish the blackboard paradigm as a scalable and generalizable communication framework for multi-agent systems.
- Europe > Austria > Vienna (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- (4 more...)
- Information Technology (0.46)
- Health & Medicine (0.34)
Impact and influence of modern AI in metadata management
Yang, Wenli, Fu, Rui, Amin, Muhammad Bilal, Kang, Byeong
Metadata management plays a critical role in data governance, resource discovery, and decision-making in the data-driven era. While traditional metadata approaches have primarily focused on organization, classification, and resource reuse, the integration of modern artificial intelligence (AI) technologies has significantly transformed these processes. This paper investigates both traditional and AI-driven metadata approaches by examining open-source solutions, commercial tools, and research initiatives. A comparative analysis of traditional and AI-driven metadata management methods is provided, highlighting existing challenges and their impact on next-generation datasets. The paper also presents an innovative AI-assisted metadata management framework designed to address these challenges. This framework leverages more advanced modern AI technologies to automate metadata generation, enhance governance, and improve the accessibility and usability of modern datasets. Finally, the paper outlines future directions for research and development, proposing opportunities to further advance metadata management in the context of AI-driven innovation and complex datasets.
- Oceania > Australia > Tasmania (0.04)
- Europe > United Kingdom (0.04)
- Europe > Germany > Saxony > Leipzig (0.04)
- (7 more...)
- Research Report (1.00)
- Overview > Innovation (0.34)
- Law (1.00)
- Information Technology > Services (1.00)
- Information Technology > Security & Privacy (1.00)
- Health & Medicine (1.00)
GNN: Graph Neural Network and Large Language Model for Data Discovery
Our algorithm GNN: Graph Neural Network and Large Language Model for Data Discovery inherit the benefits of \cite{hoang2024plod} (PLOD: Predictive Learning Optimal Data Discovery), \cite{Hoang2024BODBO} (BOD: Blindly Optimal Data Discovery) in terms of overcoming the challenges of having to predefine utility function and the human input for attribute ranking, which helps prevent the time-consuming loop process. In addition to these previous works, our algorithm GNN leverages the advantages of graph neural networks and large language models to understand text type values that cannot be understood by PLOD and MOD, thus making the task of predicting outcomes more reliable. GNN could be seen as an extension of PLOD in terms of understanding the text type value and the user's preferences, not only numerical values but also text values, making the promise of data science and analytics purposes.
- North America > United States > Ohio (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- Asia > China (0.04)
CMDBench: A Benchmark for Coarse-to-fine Multimodal Data Discovery in Compound AI Systems
Feng, Yanlin, Rahman, Sajjadur, Feng, Aaron, Chen, Vincent, Kandogan, Eser
Compound AI systems (CASs) that employ LLMs as agents to accomplish knowledge-intensive tasks via interactions with tools and data retrievers have garnered significant interest within database and AI communities. While these systems have the potential to supplement typical analysis workflows of data analysts in enterprise data platforms, unfortunately, CASs are subject to the same data discovery challenges that analysts have encountered over the years -- silos of multimodal data sources, created across teams and departments within an organization, make it difficult to identify appropriate data sources for accomplishing the task at hand. Existing data discovery benchmarks do not model such multimodality and multiplicity of data sources. Moreover, benchmarks of CASs prioritize only evaluating end-to-end task performance. To catalyze research on evaluating the data discovery performance of multimodal data retrievers in CASs within a real-world setting, we propose CMDBench, a benchmark modeling the complexity of enterprise data platforms. We adapt existing datasets and benchmarks in open-domain -- from question answering and complex reasoning tasks to natural language querying over structured data -- to evaluate coarse- and fine-grained data discovery and task execution performance. Our experiments reveal the impact of data retriever design on downstream task performance -- a 46% drop in task accuracy on average -- across various modalities, data sources, and task difficulty. The results indicate the need to develop optimization strategies to identify appropriate LLM agents and retrievers for efficient execution of CASs over enterprise data.
- Asia > Japan (0.04)
- North America > United States > Utah (0.04)
- North America > United States > Minnesota (0.04)
- (5 more...)
- Information Technology > Data Science > Data Mining > Big Data (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Scientific Discovery (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)
METAM: Goal-Oriented Data Discovery
Galhotra, Sainyam, Gong, Yue, Fernandez, Raul Castro
Data is a central component of machine learning and causal inference tasks. The availability of large amounts of data from sources such as open data repositories, data lakes and data marketplaces creates an opportunity to augment data and boost those tasks' performance. However, augmentation techniques rely on a user manually discovering and shortlisting useful candidate augmentations. Existing solutions do not leverage the synergy between discovery and augmentation, thus under exploiting data. In this paper, we introduce METAM, a novel goal-oriented framework that queries the downstream task with a candidate dataset, forming a feedback loop that automatically steers the discovery and augmentation process. To select candidates efficiently, METAM leverages properties of the: i) data, ii) utility function, and iii) solution set size. We show METAM's theoretical guarantees and demonstrate those empirically on a broad set of tasks. All in all, we demonstrate the promise of goal-oriented data discovery to modern data science applications.
- North America > United States > Illinois > Cook County > Chicago (0.04)
- Europe > Middle East > Cyprus > Nicosia > Nicosia (0.04)
- Oceania > Australia > Victoria > Melbourne (0.04)
- (11 more...)
- Education (0.93)
- Banking & Finance > Real Estate (0.46)
Computer scientist, Data scientist or similar with a focus on knowledge management (f/m/x) - Data Discovery for Anonymised Health Data
The focus of the DLR Institute for Data Science in Jena is to find solutions for the major challenges of the digitalisation age. The research focuses on the areas of data extraction and mobilisation, data management and preparation, and data analysis and intelligence. The position is part of the BMBF project Avatar (anonymisation of personal health data by creating virtual avatars). Topics include, in particular, the semantic modelling of relevant metadata and data discovery. The overall goal of the project is providing anonymised health data for both academic and commercial research.
Data Discovery for ML Engineers / DataScienceCentral.com
Real-world production ML systems consist of two main components: data and code. Data is clearly the leader, and rapidly taking center stage. Data defines the quality of almost any ML-based product, more so than code or any other aspect. In Feature Store as a Foundation for Machine Learning, we have discussed how feature stores are an integral part of the machine learning workflow. They improve the ROI of data engineering, reduce cost per model, and accelerate model-to-market by simplifying feature definition and extraction.
Exclusive Interview with Naren Vijay, EVP of Lumenore
Organizational intelligence (OI) is the capability of an organization to comprehend and create knowledge relevant to its purpose. In other words, it is the intellectual capacity of the entire organization. Lumenore is a powerful, intuitive, and cloud-based BI and analytics platform that delivers organizational intelligence by sifting data from any business application. Analytics Insight has engaged in an exclusive interview with Naren Vijay, EVP of Lumenore. Lumenore is a powerful, intuitive, and cloud-based BI and analytics platform that delivers organizational intelligence by sifting data from any business application.
- Information Technology > Artificial Intelligence (1.00)
- Information Technology > Data Science > Data Mining > Big Data (0.79)
How advanced AI tools can give organisations a holistic understanding of their data and improve compliance
It doesn't generate revenue, but it is an essential part of operating effectively as a business today. Whether it's industry specific regulations, or the standout regulation of our time--GDPR--we are all acutely aware of the damage, both reputational and financial, that non-compliance can cause. GDPR has equipped employees across industries with an appreciation of the context, usage, and security of data, but there is another factor that is essential for establishing an effective data strategy, which is data discoverability. To ensure regulatory compliance, data must not only be secure, it must also be discoverable so that compliance personnel can locate all information needed to prove compliance. Increasingly, AI tools are being harnessed to automate workflows and governance, but such capabilities can only be delivered when a strong data foundation is in place.
- Law (0.92)
- Information Technology > Security & Privacy (0.76)